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Prosody Based Endpoint Detection 



FIELD OF THE INVENTION 

The present invention pertains to endpoint detection in the processing of 
5 speech/ such as in speech recognition. More particularly, the present invention 
relates to the detection of the endpoint of an utterance using prosody. 
BACKGROUND OF THE INVENTION 

In a speech recognition system, a device commonly known as an "endpoint 
detector" separates the speech segment(s) of an utterance represented in an input 

1 0 signal from the non-speech segments, i.e., it identifies the "endpoints" of speech. 
An "endpoint" of speech can be either the beginning of speech after a period of 
non-speech or the ending of speech before a period of non-speech. An endpoint 
detector may be either hardware-based or software-based, or both. Because 
endpoint detection generally occurs early in the speech recognition process, the 

1 5 accuracy of the endpoint detector is crucial to the performance of the overall 
speech recognition system. Accurate endpoint detection will facilitate accurate 
recognition results, while poor endpoint detection will often cause poor 
recognition results. 

Some conventional endpoint detectors operate using log energy and/or 

20 spectral information as knowledge sources. For example, by comparing the log 

energy of the input speech signal against a threshold energy level, an endpoint can 
be identified. An end-of-utterance can be identified, for example, if the log energy 
drops below the threshold level after having exceeded the threshold level for some 
specified length of time. However, this approach does not take into consideration 

25 many of the characteristics of human speech. As a result, this approach is only a 
rough approximation, such that purely energy-based endpoint detectors are not as 
accurate as desired. 

One problem associated with endpoint detection is distinguishing between 
a mid-utterance pause and the end of an utterance. In making this determination, 

30 there is generally an inherent trade-off between achieving short latency and 
detecting the entire utterance. 
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SUMMARY OF THE INVENTION 

A method and apparatus for performing endpoint detection are provided. 
In the method, a speech signal representing an utterance is input. The utterance 
has an intonation, based on which the endpoint of the utterance is identified. In 
5 particular embodiments, endpoint identification may include referencing the 
intonation of the utterance against an intonation model. 

Other features of the present invention will be apparent from the 
accompanying drawings and from the detailed description which follows. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention is illustrated by way of example and not limitation in 
the figures of the accompanying drawings, in which like references indicate 
similar elements and in which: 
5 Figure 1 is a block diagram of a speech recognition system; 

Figure 2 is a block diagram of a processing system that may be configured 
to perform speech recognition; 

Figure 3 is a flow diagram showing an overall process for performing 
endpoint detection using prosody; 
1 0 Figure 4 is a flow diagram showing in greater detail the process of Figure 3, 

according to one embodiment; and 

Figures 5A and 5B are flow diagrams showing in greater detail the process 
of Figure 3, according to a second embodiment. 
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DETAILED DESCRIPTION 

A method and apparatus for detecting endpoints of speech using prosody 
are described. Note that in this description, references to "one embodiment" or 
"an embodiment" mean that the feature being referred to is included in at least 
5 one embodiment of the present invention. Further, separate references to "one 
embodiment" in this description do not necessarily refer to the same embodiment; 
however, neither are such embodiments mutually exclusive, unless so stated and 
except as will be readily apparent to those skilled in the art. 

As described in greater detail below, an end-of-utterance condition can be 

1 0 identified by an endpoint detector based, at least in part, on the prosody 

characteristics of the utterance. Other knowledge sources, such as log energy 
and/or spectral information may also be used in combination with prosody. Note 
that while endpoint detection generally involves identifying both beginning-of- 
utterance and end-of-utterance conditions (i.e., separating speech from non- 

1 5 speech), the techniques described herein are directed primarily toward identifying 
an end-of-utterance condition. Any conventional endpointing technique may be 
used to identify a beginning-of-utterance condition, which technique(s) need not 
be described herein. Nonetheless, it is contemplated that the prosody-based 
techniques described herein may be extended or modified to detect a beginning- 

20 of-utterance condition as well. The processes described herein are real-time 
processes that operate on a continuous audio signal, examining the incoming 
speech frame-by-frame to detect an end-of-utterance condition. 

"Prosody" is defined herein to include characteristics such as intonation 
and syllable duration. Hence, an end-of-utterance condition may be identified 

25 based, at least in part, on the intonation of the utterance, the duration of one or 
more syllables of the utterance, or a combination of these and/ or other variables. 
For example, in many languages, including English, the end of an utterance often 
has a generally decreasing intonation. This fact can be used to advantage in 
endpoint detection, as further described below. Various types of prosody models 

30 may be used in this process. This prosody based approach, therefore, makes use 
of more of the inherent features of human speech than purely energy-based 



approaches and other more traditional approaches. Among other advantages^ the 
use of intonation in the endpoint detection process helps to more accurately 
distinguish between a mid-utterance pause and an end-of-utterance condition, 
without adversely affecting latency. Consequently, the prosody based approach 
5 provides more accurate endpoint detection without adversely affecting latency 
and thereby facilitates improved speech recognition. 

Figure 1 shows an example of a speech recognition system in which the 
present endpoint detection technique can be implemented. The illustrated system 
includes a dictionary 2, a set of acoustic models 4, and a grammar /language 

1 0 model 6. Each of these elements may be stored in one or more conventional 

storage devices. The dictionary 2 contains all of the words allowed by the speech 
application in which the system is used. The acoustic models 4 are statistical 
representations of all phonetic units and subunits of speech that may be foimd in a 
speech waveform. The grammar /language model 6 is a statistical or deterministic 

1 5 representation of all possible combinations of word sequences that are allowed by 
the speech application. The system further includes an audio front end 7 and a 
speech decoder 8. The audio front end includes an endpoint detector 5. The 
endpoint detector 8 has access to one or more prosody models 3-1 through 3-N, 
which are discussed further below. 

20 An input speech signal is received by the audio front end 7 via a 

microphone, telephony interface, computer network interface, or any other 
suitable input interface. The audio front end 7 digitizes the speech waveform (if 
not already digitized), endpoints the speech (using the endpoint detector 5), and 
extracts feature vectors (also known as features, observations, parameter vectors, 

25 or frames) from the digitized speech. In some implementations, endpointing 

precedes feature extraction, while in other implementations feature extraction may 
precede endpointing. To facilitate description, the former case is assumed 
henceforth in this description. 

Thus, the audio front end 7 is essentially responsible for processing the 

30 speech waveform and transforming it into a sequence of data points that can be 
better modeled by the acoustic models 4 than the raw waveform. The extracted 
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feature vectors are provided to the speech decoder 8, which references the feature 
vectors against the dictionary 2, the acoustic models 4, and the 
grammar /language model 6, to generate recognized speech data. The recognized 
speech data may further be provided to a natural language interpreter (not 
5 shown), which interprets the meaning of the recognized speech. 

The prosody based endpoint detection technique is implemented within the 
endpoint detector 5 in the audio front end 7. Note that audio front ends which 
perform the above functions but without a prosody based endpoint detection 
technique are well known in the art. The prosody based endpoint detection 

1 0 technique may be implemented using software, hardware, or a combination of 
hardware and software. For example, the technique may be implemented by a 
microprocessor or Digital Signal Processor (DSP) executing sequences of software 
instructions. Alternatively, the technique may be implemented using only 
hardwired circuitry, or a combination of hardwired circuitry and executing 

1 5 software instructions. Such hardwired circuitry may include, for example, one or 
more microcontrollers. Application Specific Integrated Circuits (ASICs), 
Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), 
A/D converters, and/ or other suitable components. 

The system of Figure 1 may be implemented in a conventional processing 

20 system, such as a personal computer (PC), workstation, hand-held computer. 
Personal Digital Assistant (PDA), etc. Alternatively, the system may be 
distributed between two or more such processing systems, which may be 
cormected on a network. Figure 2 is a high-level block diagram of an example of 
such a processing system. The processing system of Figure 2 includes a central 

25 processing unit (CPU) 10 (e.g., a microprocessor), random access memory (RAM) 
11, read-only memory (ROM) 12, and a mass storage device 13, each connected to 
a bus system 9. Mass storage device 13 may include any suitable device for 
storing large volumes of data, such as magnetic disk or tape, magneto-optical 
(MO) storage device, or any of various t)^es of Digital Versatile Disk (DVD) or 

30 compact disk (CD) based storage, flash memory, etc. The bus system 9 may 
include one or more buses connected to each other through various bridges. 
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controllers and/ or adapters, such as are well-known in the art. For example, the 
bus system 9 may include a system bus that is connected through an adapter to 
one or more expansion buses, such as a Peripheral Component Interconnect (PCI) 
bus. 

5 Also coupled to the bus system 9 are an audio interface 14, a display device 

15, input devices 16 and 17, and a communication device 30. The audio interface 
14 allows the computer system to receive an input audio signal that includes the 
speech signal. The audio interface 14 includes circuitry and (in some 
embodiments) software instructions for receiving an input audio signal which 

1 0 includes the speech signal, which may be received from a microphone, a telephone 
line, a network interface, etc., and for transferring such signal onto the bus system 
9. Thus, prosody based endpoint detection as described herein may be performed 
within the audio interface 14. Alternatively, the endpoint detection may be 
performed within the CPU 10, or partly within the CPU 10 and partly within the 

1 5 audio interface 14. The audio interface may include one or more DSPs, general- 
purpose microprocessors, microcontrollers, ASICs, PLDs, FPGAs, A/D converters, 
and/ or other suitable components. 

The display device 15 may be any suitable device for displaying 
alphanumeric, graphical and/ or video data to a user, such as a cathode ray tube 

20 (CRT), a liquid crystal display (LCD), or the like, and associated controllers. The 
input devices 16 and 17 may include, for example, a conventional pointing device, 
a keyboard, etc. The communication device 18 may be any device suitable for 
enabling the computer system to communicate data with another processing 
system over a network via a data link 20, such as a conventional telephone 

25 modem, a wireless modem, a cable modem, an Integrated Services Digital 

Network (ISDN) adapter, a Digital Subscriber Line (DSL) modem, an Ethernet 
adapter, or the like. 

Note that some of these components may be omitted in certain 
embodiments, and certain embodiments may include additional or substitute 

30 components that are not mentioned here. Such variations will be readily apparent 
to those skilled in the art. As an example of such a variation, the fimctions of the 



audio interface 14 and the communication device 18 may be provided in a single 
device. As another example, the peripheral components connected to the bus 
system 9 might further include audio speakers and associated adapter circuitry. 
As yet another example, the display device 15 may be omitted if the processing 
5 system has no direct interface to a user. 

Prosody based endpoint detection may be based, at least in part, on the 
intonation of utterances. Of course, endpoint detection may also be based on 
other prosodic information and/or on non-prosodic information, such as log 
energy. 

1 0 Figure 3 shows, at a high level, a process for detecting an end-of-utterance 

condition based on prosody, according to one embodiment. The next frame of 
speech representing at least part of an utterance is initially input to the endpoint 
detector 5 at 301. The end-of-utterance condition is identified at 302 based (at 
least) on the intonation of the utterance, and the routine then repeats. Note that 

1 5 this process and the processes described below are real-time processes that operate 
on a continuous audio signal, examining the incoming speech frame-by-frame to 
detect an end-of-utterance condition. For purposes of detecting an end-of- 
utterance condition, the time frame of this audio signal may be assumed to be after 
the start of speech. 

20 As noted, other types of prosodic parameters and more traditional, non- 

prosodic knowledge sources can also be used to detect an end-of-utterance 
condition (although not so indicated in Figure 3). A technique for combining 
multiple knowledge sources to make a decision is described in U.S. Patent no. 
5,097,509 of Lennig, issued on March 17, 1992 ("Lennig"), which is incorporated 

25 herein by reference. In accordance with the present invention, the technique 
described by Lennig may be used to combine multiple prosodic knowledge 
sources, or to combine one or more prosodic knowledge sources with one or more 
non-prosodic knowledge sources, to detect an end-of-utterance condition. The 
technique involves creating a histogram, based on training data, for each 

30 knowledge source. Training data consists of both "positive" and "negative" 
utterances. Positive utterances are defined as those utterances which meet the 
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criterion of interest (e.g., end-of-utterance), while negative utterances are defined 
as those utterances which do not. Each knowledge source is represented as a 
scalar value. The bin boundaries of each histogram partition the range of the 
feature into a number of bins. These boundaries are determined empirically so 
5 that there is enough resolution to distinguish useful differences in values of the 
knowledge source but so that there is a sufficient amount of data in each bin. The 
bins need not be of uniform width. 

It may be useful to smooth the histograms, particularly when there is 
limited training data. One approach to doing so is "medians of three" smoothing, 

1 0 described in J.W. Tukey, "Smoothing Sequences," Exploratory Data Analysis, 

Addison-Wesley, 1977. In medians of three smoothing, starting at one end of the 
histogram and processing each bin in order until reaching the other end, the count 
of each bin is replaced by the median of the counts of that bin and the two adjacent 
bins. The smoothing is appUed separately to the positive and negative bin coimts. 

15 At run time, a given knowledge source (e.g., intonation) is measured. The 

value of this knowledge source determines the histogram bin into which it falls. 
Suppose that bin is bin number K. Let A represent the number of positive training 
utterances that fell into bin K and let B represent the number of negative training 
utterances that fell into bin K. A probability score of this knowledge source is 

20 then computed as P5 = A/(A+B), where represents the probability that the 

criterion of interest is satisfied given the current value of this knowledge source. 
The same process is used for each additional knowledge source. The probabilities 
of the different knowledge sources are then combined to generate an overall 
probabiHty P as follows: P = (P,**wJ(P/'^w,)(P3**W3)...(P/*w^), where the "**" 

25 operator indicates exponentiation and w^, w^, W3, etc. are empirically-determined, 
non-negative weights that sum to one. 

Intonation of an utterance is one prosodic knowledge source that can be 
useful in endpoint detection. Various techniques can be used to determine the 
intonation. The intonation of an utterance is represented, at least in part, by the 

30 change in fundamental frequency of the utterance over time. Hence, the 
intonation of an utterance may be determined in the form of a pattern (an 



"intonation pattern") indicating the change in fundamental frequency of the 
utterance over time. In the English language, a generally decreasing fundamental 
frequency is more indicative of an end-of-utterance condition than a generally 
increasing fundamental frequency. Hence, a decline in fimdamental frequency 
5 may represent decreasing intonation, which may be evidence of an end-of- 
utterance condition. 

There are many possible approaches to mapping a declining fundamental 
frequency pattern into a scalar feature, for use in the above-described histogram 
approach. The intonation pattern may be, for example, a single computation 
1 0 based on the difference in fundamental frequency between two frames of data, or 
it may be based on multiple differences for three or more (potentially overlapping) 
frames within a predetermined time range. For this purpose, it may be sufficient 
to examine the most recent approximately 0.6 to 1.2 seconds or one to three 
syllables of speech. 

1 5 One specific approach involves computing the smoothed first difference of 

the fundamental frequency. Let F(n) represent the fimdamental frequency, FO, of 
frame n. Let F'(n) = F(n) - F(n-1) represent the first difference of F(n). Let f(n) = 
aF'(n) - (l-a)f(n-l), where 0<a<l, represent the smoothed first difference of F(n). 
The value of "a" is tuned empirically so that f(n) becomes as negative as possible 

20 when the FO pattern declines at the end of an utterance. Use f (n) as an input 

feature to the histogram method. Note that when F(n) is undefined because it is in 
an unvoiced segment of speech, F(n) may be defined as F(n-l). 

Other approaches could capture more information about the time evolution 
of the fundamental frequency pattern using techniques such as Hidden Markov 

25 Models, where the parameter f(n) is the observation parameter. 

The intonation pattern may additionally (or alternatively) include the 
relationship between the current fundamental frequency and the fundamental 
frequency range of the speaker. For example, a drop in fundamental frequency to 
a value that is near the low end of the fundamental frequency range of the speaker 

30 may suggest an end-of-utterance condition. It may be desirable to treat as two 
distinct knowledge sources the change in fundamental frequency over time and 
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the relationship between the current fundamental frequency and the speaker's 
fundamental frequency range. In that case, these two intonation-based knowledge 
sources may be combined using the above-described histogram approach, for 
purposes of detecting an end-of-utterance condition. 
5 To apply the histogram approach to the latter-mentioned knowledge 

source, the low end of the speaker's fundamental frequency range is computed as 
a scalar. One way of doing this is simply to use the minimum observed 
fundamental frequency for the speaker. The fimdamental frequency range of the 
speaker may be determined adaptively from utterances of the speaker earlier in a 

1 0 dialog. In one embodiment, the system asks the speaker a question specifically 
designed to elicit a response conducive to determining the low end of the 
speaker's fundamental frequency range. This may be a simple yes/ no question, 
the response of which will normally contain the word "yes" or "no" with a falling 
intonation approaching the low end of the speaker's fundamental frequency 

1 5 range. The fundamental frequency of the vowel of the speaker's response may be 
used as an initial estimate of the low end of the speaker's fundamental frequency 
range. However this low end of the fundamental frequency range is estimated, 
designate it as C. Hence, the value input to the fundamental frequency range 
histogram may be computed as FO - C. 

20 Any of various knowledge sources may be used as input in the histogram 

technique described above, to compute the probability P. These knowledge 
sources may include, for example, any one or more of the following: silence 
duration, silence duration normalized for peaking rate, f(n) as defined above, FO - 
C as defined above, final syllable duration, final syllable duration normalized for 

25 phonemic content, final syllable duration normalized for stress, or final syllable 
duration normalized for a combination of the foregoing parameters. 

Various non-histogram based approaches can also be used to perform 
prosody based endpoint detection. Figure 4 illustrates a non-histogram based 
approach for prosody based determination of an end-of-utterance condition, 

30 according to one embodiment, which may be implemented in the endpoint 

detector 5. Initially, the next frame of speech is input to the endpoint detector 5 at 



401. It is next determined at 402 whether the log energy (the logarithm of the 
energy of the speech signal) is below a predetermined energy threshold level. 
This threshold level may be set dynamically and adaptively. The specific value of 
the threshold level may also depend on various factors, such as the specific 
5 application of the system and desired system performance, and is therefore not 
provided herein. If the log energy is not below the threshold level, the process 
repeats from 401. If the log energy is below the threshold level, then at 403 the 
intonation pattern of the utterance is determined, which may be done as described 
above. 

1 0 Next, at 404 the intonation pattern is referenced against an intonation 

model to determine a preliminary probability that the end-of the utterance 
condition has been reached, given that intonation pattern. The intonation model 
may be one of prosody models 3-1 through 3-N in Figure 1 and may be in the form 
of a histogram based on training data, such as described above. Other examples of 

1 5 the format of the intonation model are described below. In essence, this is a 
determination of whether the intonation pattern is suggestive of an end-of- 
utterance condition. As noted above, a generally decreasing intonation may 
suggest an end-of-utterance condition. Again, it may be sufficient to examine the 
last approximately 0.6 to 1.2 seconds or one to three syllables of speech for this 

20 purpose. 

As noted above, other intonation-based parameters (e.g., the relationship 
between the fundamental frequency and the speaker's fundamental frequency 
range) may be represented in the intonation model. Alternatively, such other 
parameters may be treated as separate knowledge sources and referenced against 
25 separate intonation models to obtain separate probability values. 

Referring still to Figure 4, at 405 the amoimt of time T^ which the speech 
signal has remained below the energy threshold level is computed. This amount 
of time Tj is then referenced at 406 against a model of elapsed time to determine a 
second preliminary probability P2 that the end-of-utterance has been reached. 
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given the pause duration T^. At 407, the normalized, relative duration of the 
final syllable of the utterance is computed. Although the duration of the final 
syllable of the utterance cannot actually be known before an end-of-utterance 
condition has been identified, this computation 407 may be based on the 
5 temporary assumption (i.e., only for purposes of this computation) that an end-of- 
utterance condition has occurred. Techniques for automatically determining the 
duration of a syllable of an utterance are well-known. Once computed, the 
duration is then referenced at 408 against a syllable duration model (e.g., 
another one of prosody models 3-1 through 3-N) to determine a third preliminary 
1 0 probability P3 of end-of-utterance, given the normalized relative duration of the 
last syllable. 

At 409, the overall probability P of end-of-utterance is computed as a 
function of P^, P2 and P3, which may be, for example, a geometrically weighted 
average of P^, Pj and P3. In this computation, each probability value Pj, P^ and P3 is 

1 5 raised to a power, so that the sum of these three probabilities equals one. At 410, 
the overall probability P is compared against a threshold probability level P^^. If P 
exceeds the threshold probability P^j^ at 410, then an end-of-utterance is 
deternuned to have occurred at 411, and the process then repeats from 401. 
Otherwise, an end-of-utterance is not yet identified, and the process repeats from 

20 401. The threshold probability P^ as well as the specific or other function used to 
compute the overall probability P can depend upon various factors, such as the 
particular application of the system, the desired performance, etc. 

Many variations upon this process are possible, as will he recognized by 
those skilled in the art. For example, the order of the operations mentioned above 

25 may be changed for different embodiments. 

Referring again to operation 404 in Figure 4, the intonation model may 
have any of a variety of possible forms, an example of which is a histogram based 
on training data. In yet another approach, the intonation model may be a 
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regression model or a Gaussian distribution of training data, with an estimated 
mean and variance, against which the input data is compared to assign the 
probability values Pj. Parametric approaches such as these can optionally be 
implemented using a Hidden Markov Model to capture information about the 
5 time evolution of the intonation pattern. 

As an example of a non-parametric approach, the intonation model may be 
a prototype function of declining fundamental frequency over time (i.e., 
representing known end-of-utterance conditions). Thus, the operation 404 may be 
accomplished by computing the correlation between the observed intonation 

1 0 pattern and the prototype function. In this approach, it may be useful to express 
the prototype function and the observed intonation values as percentage increases 
or decreases in fundamental frequency, rather than as absolute values. 

As yet another example, the intonation model may be a simple look-up 
table of intonation patterns (i.e., ftinctions or values) vs. probability values P^. 

1 5 hiterpolation may be used to map input values that do not exactly match a value 
in the table. 

Referring to operation 406 in Figure 4, the model of elapsed time (during 
which the speech has exhibited low energy) may also include a histogram 
constructed from training data, or another format such as described above. Since 

20 different speech recognition grammars may give rise to different post-speech 
timeout parameters, it may be useful to introduce an additive bias that is 
adjustable through tuning, to the computation of probability Pj. This additive bias 
may be subtracted from the observed length of time T^ of low energy speech 
before using the result to compute probability P^ using the histogram approach. 

25 This approach would provide the system designer with the ability to bias the 
system to require longer silences to conclude an end-of-utterance has occurred. 

Referring to operation 408 in Figure 4, the syllable duration model may 
have essentially any form that is suitable for this purpose, such as a histogram or 
other format described above. 
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Figures 5A and 5B collectively represent another embodiment of the 
prosody based endpoint detection technique. The processes of Figures 5A and 5B 
may be performed concurrently. The process of Figure 5A is for determining a 
threshold time value T^^, which is used in the process of Figure 5B to identify an 
5 end-of-utterance condition. Specifically, the threshold time value T^^ determines 
how long the endpoint detector will wait, in response to detecting the input 
signal's log energy has fallen below a threshold level, before determining an end- 
of-utterance has occurred. 

Referring first to Figure 5A, initially the next frame of speech representing 

1 0 an utterance is input at 501. At 502, the intonation pattern of the utterance is 
determined, such as in the manner described above. At 503, a determination is 
made of whether the intonation pattern is generally suggestive of (e.g., in terms of 
probability) an end-of-utterance condition. This determination 503 may be made 
in the manner described above. If the intonation of the utterance is determined at 

1 5 503 to be suggestive of an end-of-utterance condition, then at 505 the threshold 
time value T^^ is set equal to a predetermined time value y. If not, then at 504 the 
threshold time value T^^ is set equal to a predetermined time value x, which is 
larger than (represents longer duration than) time value y. The specific values for 
X and y can depend upon various factors, such as the particular application of the 

20 system, the desired performance, etc. 

Referring now to Figure 5B, a timer variable T^ is initialized to zero at 510, 
and at 511 the next frame of speech is input. At 512, a determination is made of 
whether the log energy of the speech has dropped below the threshold level. If 
not, T^ is reset to zero at 516, and the process then repeats from 511. If the signal 

25 has dropped below the threshold level, then at 513 T^ is incremented. Next, at 514 
T4 is compared to the threshold time value T^j^ determined in the process of Figure 
5A. If T4 exceeds T^^, then at 515 an end-of-utterance condition is identified, and 
the process repeats from 510. Otherwise, an end-of-utterance condition is not yet 
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identified, and the process repeats from 511. Many variations upon these 
processes are possible without altering the basic approach, such as changing the 
ordering of the above-noted operations. 

Thus, a method and apparatus for detecting endpoints of speech using 
5 prosody have been described. Although the present invention has been described 
with reference to specific exemplary embodiments, it will be evident that various 
modifications and changes may be made to these embodiments without departing 
from the broader spirit and scope of the invention as set forth in the claims. 
Accordingly, the specification and drawings are to be regarded in an illustrative 
1 0 sense rather than a restrictive sense. 
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CLAIMS 

What is claimed is: 

1 1. A method comprising: 

2 inputting speech representing an utterance and having an intonation; and 

3 identifying an endpoint of the utterance based on the intonation. 

1 2. A method as recited in claim 1, wherein said identifying an endpoint of the 

2 utterance based on the intonation comprises comparing the intonation with an 

3 intonation model. 

1 3. A method as recited in claim 4, further comprising determining the intonation 

2 by computing the fundamental frequency of the utterance. 

1 4. A method as recited in claim 3, wherein said determining the intonation 

2 comprises using an intonation model to determine the intonation. 

1 5. A method as recited in claim 1, wherein said identifying the endpoint of the 

2 utterance comprises identifying the endpoint of the utterance based on a plurality 

3 of knowledge sources, wherein one of the knowledge sources is intonation, 

4 including referencing the input speech against a histogram based on training data 

5 for each of the knowledge sources. 

1 6. A method as recited in claim 1, further comprising: 

2 determining a period of time that has elapsed since the speech dropped 

3 below a threshold value; and 

4 wherein said identifying an endpoint of the utterance comprises identifying 

5 the endpoint of the utterance further based on the period of time. 

1 7. A method as recited in claim 1, wherein said identifying an endpoint of the 

2 utterance comprises identifying the endpoint of the utterance further based on a 
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3 length of time for which an energy value of the speech has remained below a 

4 predetermined energy value. 

1 8. A method as recited in claim 7, wherein said identifying an endpoint of the 

2 utterance further comprises identifying the endpoint of the utterance based on the 

3 duration of the final syllable of the utterance. 

1 9. A method of operating an endpoint detector, the method comprising: 

2 inputting speech representing an utterance, the utterance having an 

3 intonation; and 

4 comparing the intonation of the utterance with an intonation model; 

5 determining a probability based on a result of said comparing; and 

6 identifying an endpoint of the utterance based on the probability. 

1 10. A method as recited in claim 9, further comprising determining the intonation 

2 of the utterance as a function of the fimdamental frequency of the utterance. 

1 11. A method as recited in claim 9, further comprising: 

2 determining a period of time that has elapsed since a value of the speech 

3 dropped below a threshold value; and 

4 wherein said identifying an endpoint of the utterance comprises identifying 

5 the endpoint of the utterance further based on the period of time. 

1 12. A method as recited in claim 9, wherein said identifying an endpoint of the 

2 utterance comprises identifjdng the endpoint of the utterance further based on the 

3 duration of the final syllable of the utterance. 

1 13. A method as recited in claim 12, wherein said identifying an endpoint of the 

2 utterance comprises identifying the endpoint of the utterance further based on a 
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3 period of time for which an energy value of the speech has remained below a 

4 threshold value. 

1 14. A method of operating an endpoint detector for speech recognition, the 

2 method comprising: 

3 inputting speech representing an utterance; 

4 determining that a value of the speech has dropped below a threshold 

5 value; 

6 computing an intonation of the utterance; 

7 referencing the intonation of the utterance against an intonation model to 

8 determine a first end-of-utterance probability; 

9 determining a period of time that has elapsed since the value of the speech 

1 0 dropped below the threshold value; 

1 1 referencing the period of time against an elapsed time model to determine a 

1 2 second end-of-utterance probability; 

1 3 computing an overall end-of-utterance probability as a function of the first 

1 4 and second end-of-utterance probabilities; and 

1 5 determining whether an end-of-utterance has occurred based on the overall 

1 6 end-of-utterance probability. 

1 15. A method as recited in claim 14, wherein said computing an intonation of the 

2 utterance comprises computing an intonation of the utterance by determining the 

3 fundamental frequency of the utterance as a function of time. 

1 16. A method as recited in claim 15, further comprising: 

2 determirung a duration of a final syllable of the utterance; and 

3 referencing the duration of the final syllable against a syllable duration 

4 model to determine a third end-of-utterance probability; 

5 wherein said computing an overall end-of-utterance probability comprises 

6 computing the overall end-of-utterance probability as a function of the first, 

7 second, and third end-of-utterance probabilities. 
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1 17. A method of operating an endpoint detector for speech recognition, the 

2 method comprising: 

3 inputting speech representing an utterance; 

4 computing an intonation of the utterance; 

5 referencing the intonation of the utterance against an intonation model to 

6 determine a first end-of-utterance probability; 

7 determining a duration of a final syllable of the utterance; 

8 referencing the duration of the final syllable against a syllable duration 

9 model to determine a second end-of-utterance probability; 

1 0 computing an overall end-of-utterance probability as a function of the first 

1 1 and second end-of-utterance probabilities; and 

1 2 determining whether an end-of-utterance has occurred based on the overall 

1 3 end-of-utterance probability. 

1 18. A method as recited in claim 17, wherein said computing an intonation of the 

2 utterance comprises computing an intonation of the utterance by determining the 

3 fundamental frequency of the utterance as a function of time. 

1 19. A method as recited in claim 17, further comprising: 

2 determining that a value of the speech has dropped below a threshold 

3 value; 

4 determining a period of time that has elapsed since the value of the speech 

5 dropped below the threshold value; and 

6 referencing the period of time against an elapsed time model to determine a 

7 second end-of-utterance probability; 

8 wherein said computing an overall end-of-utterance probability comprises 

9 computing the overall end-of-utterance probability as a function of the first, 
1 0 second, and third end-of-utterance probabilities. 
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1 20. A method of operating an endpoint detector for speech recognition, the 

2 method comprising: 

3 inputting speech representing an utterance, the utterance having a time- 

4 varying fundamental frequency; 

5 determining that a value of the speech has dropped below a threshold 

6 value; 

7 computing an intonation of the utterance by determining the fundamental 

8 frequency of the utterance as a function of time; 

9 referencing the intonation of the utterance against an intonation model to 

1 0 determine a first end-of-utterance probability; 

1 1 determining a period of time that has elapsed since a value of the speech 

1 2 dropped below the threshold value; 

1 3 referencing the period of time against an elapsed time model to determine a 

1 4 second end-of-utterance probability; 

1 5 determining a duration of a final syllable of the utterance; 

1 6 referencing the duration of the final syllable against a syllable duration 

1 7 model to determine a third end-of-utterance probability; 

1 8 computing an overall end-of-utterance probability as a function of the first, 

1 9 second, and third end-of-utterance probabilities; and 

20 determining whether an end-of-utterance has occurred by comparing the 

21 overall end-of-utterance probability to a threshold probability. 

1 21. A method of operating an endpoint detector for speech recognition, the 

2 method comprising: 

3 inputting speech representing an utterance; 

4 determining an intonation of the utterance; 

5 if the intonation of the utterance is determined to be generally decreasing, 

6 then setting a threshold time period equal to a first time value; 

7 if the intonation of the utterance is determined not to be generally 

8 decreasing, then setting the threshold time period equal to a second time value 

9 larger than the first time value; and 
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1 0 identifying an endpoint of the utterance based on the threshold time 

1 1 period. 

1 22. A method as recited in claim 21, wherein said using the threshold time period 

2 to identify an endpoint of the utterance comprises using the threshold time period 

3 to identify an endpoint of the utterance by determining that an endpoint of the 

4 utterance has occurred if an energy value of the speech remains below a 

5 predetermined value for the threshold time period. 



1 23. A method as recited in claim 21, wherein said determining an intonation of the 

2 utterance comprises using an intonation model. 

1 24. A method of operating an endpoint detector for speech recognition, the 

2 method comprising: 

3 inputting speech representing an utterance, the utterance having a time- 

4 varying fundamental frequency; 

5 determining an intonation of the utterance by 

6 computing the intonation as the fundamental frequency of the 

7 utterance as a function of time, and 

8 referencing the intonation against an intonation model to determine 

9 the intonation of the utterance; 

10 if the intonation of the utterance is determined to be generally decreasing, 

1 1 then setting a threshold time period equal to a first time value; 

12 if the intonation of the utterance is determined not to be generally 

1 3 decreasing, then setting the threshold time period equal to a second time value 

1 4 larger than the first time value; and 

1 5 using the threshold time period to identify an endpoint of the utterance, by 

1 6 determining that an endpoint of the utterance has occurred if the speech remains 

1 7 below a predetermined value for a length of time equal to the threshold time 

1 8 period. 
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1 25. A machine-readable program storage medium tangibly embodying a sequence 

2 of instructions executable by a machine to perform a method for endpoint 

3 detection, the method comprising: 

4 inputting speech representing an utterance, the utterance having an 

5 intonation; and 

6 identifying an endpoint of the utterance based on the intonation of the 

7 utterance. 

1 26. A machine-readable program storage medium as recited in claim 25, wherein 

2 said using the intonation of the utterance in identifying an endpoint of the 

3 utterance comprises comparing the intonation of the utterance with an intonation 

4 model. 

1 27. A machine-readable program storage medium as recited in claim 25, wherein 

2 the method further comprises determining the intonation of the utterance. 

1 28. A machine-readable program storage medium as recited in claim 27, wherein 

2 said determining the intonation of the utterance comprises computing the 

3 fimdamental frequency of the utterance. 

1 29. A machine-readable program storage medium as recited in claim 27, wherein 

2 said determining the intonation of the utterance comprises using an intonation 

3 model to determine the intonation of the utterance. 

1 30. A machine-readable program storage medium as recited in claim 25, wherein 

2 the method further comprises: 

3 determining a period of time for which an energy value of the speech has 

4 been below a threshold value; and 

5 wherein said identifying an endpoint of the utterance comprises identifying 

6 the endpoint of the utterance further based on the period of time. 
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1 31. A machine-readable program storage medium as recited in claim 25, wherein 

2 the method further comprises: 

3 determining a duration of a final syllable of the utterance; and 

4 wherein said identifying an endpoint of the utterance comprises identifying 

5 the endpoint of the utterance further based on the duration of a final syllable of 

6 the utterance. 

1 32. A machine-readable program storage medium as recited in claim 31, wherein 

2 the method further comprises: 

3 determining a period of time that has elapsed since a value of the speech 

4 dropped below a threshold value; and 

5 wherein said identifying an endpoint of the utterance comprises identifying 

6 the endpoint of the utterance further based on the period of time. 

1 33. An endpoint detector comprising: 

2 means for inputting speech representing an utterance, the utterance having 

3 an intonation; and 

4 means for identifying an endpoint of the utterance based on the intonation 

5 of the utterance. 

1 34. An endpoint detector as recited in claim 33, wherein said means for using the 

2 intonation of the utterance in identifying an endpoint of the utterance comprises 

3 means for comparing the intonation of the utterance with an intonation model. 



1 35. An endpoint detector as recited in claim 33, further comprising means for 

2 determining the intonation of the utterance. 

1 36. An endpoint detector as recited in claim 35, wherein said means for 

2 determining the intonation of the utterance comprises means for computing the 

3 fundamental frequency of the utterance. 
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37. An endpoint detector as recited in claim 35, wherein said means for 
determining the intonation of the utterance comprises means for using an 
intonation model to determine the intonation of the utterance. 



1 38. An endpoint detector as recited in claim 33, further comprising: 

2 means for determining a period of time that has elapsed since a value of the 

3 speech dropped below a threshold value; and 

4 wherein said means for identifying an endpoint of the utterance comprises 

5 means for identifying the endpoint of the utterance further based on the period of 

6 time. 



1 39. An endpoint detector as recited in claim 33, further comprising: 

2 means for determining a duration of a final syllable of the utterance; and 

3 wherein said means for identifying an endpoint of the utterance comprises 

4 means for identifying the endpoint of the utterance further based on the duration 

5 of a final syllable of the utterance. 

1 40. An endpoint detector as recited in claim 39, further comprising: 

2 means for determining a period of time that has elapsed since a value of the 

3 speech dropped below a threshold value; and 

4 wherein said means for identifying an endpoint of the utterance comprises 

5 means for identifying the endpoint of the utterance further based on the period of 

6 time. 



1 41. An apparatus for performing endpoint detection comprising: 

2 means for inputting speech representing an utterance, the utterance having 

3 a time-varying fundamental frequency; 

4 means for determining that a value of the speech has dropped below a 

5 threshold value; 
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6 means for computing an intonation of the utterance by determining the 

7 fundamental frequency of the utterance as a function of time; 

8 means for referencing the intonation of the utterance against an intonation 

9 model to determine a first end-of-utterance probability; 

1 0 means for determining a period of time that has elapsed since the speech 

1 1 dropped below the threshold value; 

1 2 means for referencing the period of time against an elapsed time model to 

1 3 determine a second end-of-utterance probability; 

1 4 means for referencing the duration of the final syllable of the utterance 

1 5 against a syllable duration model to determine a third end-of-utterance 

1 6 probability; 

1 7 means for computing an overall end-of-utterance probability as a function 

1 8 of the first, second, and third end-of-utterance probabilities; and 

1 9 means for determining whether an end-of-utterance has occurred by 

20 comparing the overall end-of-utterance probability to a threshold probability. 
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ABSTRACT OF THE DISCLOSURE 

A method and apparatus are provided for performing prosody based 
endpoint detection of speech in a speech recognition system. Input speech 
represents an utterance, which has an intonation pattern. An end-of-utterance 
condition is identified based on prosodic parameters of the utterance, such as the 
intonation pattern and the duration of the final syllable of the utterance, as well as 
non-prosodic parameters, such as the log energy of the speech. 
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A. Twarowski, Reg. No. 42,191; Lester J. Vincent, Reg. No. 31,460; Glenn E. Von Tersch, Reg. No. 
41,364; John Patrick Ward, Reg. No. 40,216; Mark L. Watson, Reg. No. P46,322; Thomas C. Webster, 
Reg. No. P46,154; Charles T. J. Weigell, Reg. No. 43,398; Kirk D. Williams, Reg. No. 42,229; James M. 
Wu, Reg. No. 45,241; Steven D. Yates, Reg. No. 42,242; and Norman Zafman, Reg. No. 26,250; my 
patent attorneys, and Justin M. Dillon, Reg. No. 42,486; my patent agent, of BLAKELY, SOKOLOFF, 
TAYLOR & ZAFMAN LLP, with offices located at 12400 Wilshire Boulevard, 7th Floor, Los Angeles, 
California 90025, telephone (310) 207-3800, and James R. Thein, Reg. No. 31 ,710, my patent attorney. 
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APPENDIX B 



Title 37, Code of Federal Regulations, Section 1.56 
Duty to Disclose Information Material to Patentability 

(a) A patent by its very nature is affected with a public interest. The public interest is best seryed, 
and the most effective patent examination occurs when, at the time an application is being examined, the 
Office is aware of and evaluates the teachings of all information material to patentability. Each individual 
associated with the filing and prosecution of a patent application has a duty of candor and good faith in 
dealing with the Office, which includes a duty to disclose to the Office all information known to that individual 
to be material to patentability as defined in this section. The duty to disclosure information exists with respect 
to each pending claim until the claim is cancelled or withdrawn from consideration, or the application becomes 
abandoned. Information material to the patentability of a claim that is cancelled or withdrawn from 
consideration need not be submitted if the information is not material to the patentability of any claim 
remaining under consideration in the application. There is no duty to submit information which is not material 
to the patentability of any existing claim. The duty to disclosure all information known to be material to 
patentability is deemed to be satisfied if all information known to be material to patentability of any claim 
issued in a patent was cited by the Office or submitted to the Office in the manner prescribed by §§1 .97(b)-(d) 
and 1 .98. However, no patent will be granted on an application in connection with which fraud on the Office 
was practiced or attempted or the duty of disclosure was violated through bad faith or intentional misconduct. 
The Office encourages applicants to carefully examine: 

(1 ) Prior art cited in search reports of a foreign patent office in a counterpart application, and 

(2) The closest information over which individuals associated with the filing or prosecution of a 
patent application believe any pending claim patentably defines, to make sure that any material information 
contained therein is disclosed to the Office. 

(b) Under this section, information is material to patentability when it is not cumulative to 
information already of record or being made or record in the application, and 

(1) It establishes, by itself or in combination with other information, a prima facie case of 
unpatentability of a claim; or 

(2) It refutes, or is inconsistent with, a position the applicant takes in: 

(i) Opposing an argument of unpatentability relied on by the Office, or 

(ii) Asserting an argument of patentability. 

A prima facie case of unpatentability is established when the information compels a conclusion that a claim is 
unpatentable under the preponderance of evidence, burden-of-proof standard, giving each term in the claim 
its broadest reasonable construction consistent with the specification, and before any consideration is given to 
evidence which may be submitted in an attempt to establish a contrary conclusion of patentability. 

(c) Individuals associated with the filing or prosecution of a patent application within the 
meaning of this section are: 

(1) Each inventor named in the application; 

(2) Each attorney or agent who prepares or prosecutes the application; and 

(3) Every other person who is substantively involved in the preparation or prosecution of the 
application and who is associated with the inventor, with the assignee or with anyone to whom there is an 
obligation to assign the application. 

(d) Individuals other than the attorney, agent or inventor may comply with this section by 
disclosing information to the attorney, agent, or inventor. 
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